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Abstract 

Background: Clostridium autoethanogenum strain JA1-1 (DSM 10061) is an acetogen capable of fermenting CO, 
C0 2 and H 2 (e.g. from syngas or waste gases) into biofuel ethanol and commodity chemicals such as 2,3-butanediol. 
A draft genome sequence consisting of 100 contigs has been published. 

Results: A closed, high-quality genome sequence for C. autoethanogenum DSM10061 was generated using only 
the latest single-molecule DNA sequencing technology and without the need for manual finishing. It is assigned to 
the most complex genome classification based upon genome features such as repeats, prophage, nine copies of 
the rRNA gene operons. It has a low G + C content of 31.1%. Illumina, 454, lllumina/454 hybrid assemblies were 
generated and then compared to the draft and PacBio assemblies using summary statistics, CGAL, QUAST and 
REAPR bioinformatics tools and comparative genomic approaches. Assemblies based upon shorter read DNA 
technologies were confounded by the large number repeats and their size, which in the case of the rRNA gene 
operons were ~5 kb. CRISPR (Clustered Regularly Interspaced Short Paloindromic Repeats) systems among 
biotechnologically relevant Clostridia were classified and related to plasmid content and prophages. Potential 
associations between plasmid content and CRISPR systems may have implications for historical industrial scale 
Acetone-Butanol-Ethanol (ABE) fermentation failures and future large scale bacterial fermentations. While 
C. autoethanogenum contains an active CRISPR system, no such system is present in the closely related Clostridium 
ljungdahlii DSM 13528. A common prophage inserted into the Arg-tRNA shared between the strains suggests a 
common ancestor. However, C. ljungdahlii contains several additional putative prophages and it has more than 
double the amount of prophage DNA compared to C autoethanogenum. Other differences include important 
metabolic genes for central metabolism (as an additional hydrogenase and the absence of a phophoenolpyruvate 
synthase) and substrate utilization pathway (mannose and aromatics utilization) that might explain phenotypic 
differences between C. autoethanogenum and C. ljungdahlii. 

Conclusions: Single molecule sequencing will be increasingly used to produce finished microbial genomes. The 
complete genome will facilitate comparative genomics and functional genomics and support future comparisons 
between Clostridia and studies that examine the evolution of plasmids, bacteriophage and CRISPR systems. 
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Background 

The development of next-generation DNA sequencing 
technologies since the first human genome sequence was 
completed has led to remarkable increases in sequencing 
efficiency on the order of approximately 100,000-fold [1]. 
Costs have dropped dramatically and computational me- 
thods have advanced along with sequencing technology, 
leading to large increases in DNA sequencing output and 
in the number of available genome sequences [2,3]. A 
variety of assembly algorithms and methods for quality 
evaluation have been developed [3-13]. However, the 
majority of sequenced genomes are incomplete due to 
technical difficulties, time, and expense leading to an 
increasing disparity between the number of finished and 
draft genomes in databases [1-3,5,14]. 

The PacBio sequencing system [15] is the only long- 
read, single-molecule sequencer available at present and 
the performance of the PacBio RS system was compared 
to two short-read sequencing platforms also released in 
2011 [16]. The original RS system with CI chemistry gen- 
erated mean read-lengths in the range of 1,500 bp and 
yielded approximately 100 Mb of sequence data per run, 
and reads in this range were useful in generating improved 
scaffolds for de novo assemblies. However, the original sys- 
tem was not optimal for de novo assembly applications 
[16] and hybrid assembly approaches have been developed 
to overcome limitations in short-read technologies and 
higher error rates associated with third-generation tech- 
nology [17,18]. 

Repetitive stretches of DNA are abundant and are one 
of the main technical challenges that hinder accurate se- 
quencing and genome assembly efforts [1]. In the case of 
bacteria, the rRNA gene operon is often the largest region 
of repetitive sequence and range in size between 5 and 
7 kb [19]. Last year, the longest PacBio RS reads were re- 
ported as being approximately 14 kb and these longer 
reads are useful in resolving repeats during genome as- 
semblies [3]. The PacBio RS II system was released last 
year and it produces more and longer reads. In a recent 
study, the longest read before correction was 15,634 bp 
and the genomes of six bacteria were sequenced and as- 
sembled using single-molecule sequencing based on C2 
chemistry [14], Koren et al, [14] suggested that the 
majority of bacterial genomes could be assembled into 
finished-grade quality, that is, without gaps, and with 
data derived from a single PacBio sequencing library per 
sample [14]. The combination of the longer reads, depth 



of coverage and random nature of sequencing errors 
facilitates de novo assemblies for microbial isolates 
[15,20,21]. The advantages of single-molecule sequen- 
cing have been discussed [22]. To date, relatively few 
genomes sequences have been determined exclusively 
using single-molecule technology and only a handful 
represent finished genomes [14,20,21,23-25]. 

In this study, a finished genome sequence for Clostridium 
autoethanogenum strain JA1-1 (DSM 10061) was gene- 
rated using the latest PacBio RS II instrument. This re- 
presents one of the first de novo genomes finished into a 
single contiguous sequence using RS II data alone (that is, 
without addition of other next-generation sequence data 
or manual finishing steps). To offer insights into this tech- 
nology, the PacBio assembly was compared to assemblies 
based on 454 GS FLX Titanium and Illumina MiSeq data 
and an earlier draft genome sequence of 100 con tigs for 
this strain obtained from 454 GS FLX Titanium and Ion 
Torrent data [26]. 

C. autoethanogenum is an anaerobic, Gram-positive, 
mesophilic, acetogenic bacterium isolated using carbon 
monoxide (CO) [27]. Other substrates include the green- 
house gas C0 2 plus H 2 , pyruvate, xylose, arabinose, fruc- 
tose, rhamnose, and L-glutamate. There is significant 
biotechnological interest in this organism as well as other 
acetogenic bacteria for their ability to use gases containing 
CO, H 2 and C0 2 as the sole source of carbon and energy 
for the production of fuel and chemicals at scale. The 
ability to use these gases in fermentative processes enables 
acetogens to potentially provide a route to more sustainable 
fuel and chemical production from a range of feedstocks in- 
cluding biomass and municipal solid waste-derived syngas, 
reformed biogas and industrial waste gases derived, for ex- 
ample, from steel production facilities [28-33]. 

Results and discussion 

Sequencing output and assembly statistics for 
C. autoethanogenum DSM 10061 

Sequencing statistics show that for each platform a large 
number of raw reads were attained that resulted in high 
degrees of genome coverage (Table 1). Raw Illumina data 
were trimmed and filtered before assembly, but in the case 
of the 454 and PacBio assemblers raw instrument output 
files were used. Bruno-Barcena et al used a combination 
of 454 GS FLX Titanium and Ion Torrent Personal Gen- 
ome Machine (PGM) data to generate a genome reported 
as 4.5 Mb for C. autoethanogenum DSM 10061 [26]. The 



Table 1 Sequencing statistics 

Number of reads Total bases Mean read length (bp) Longest read (bp) Coverage (x) 

454-3 kbPE 511,515 202,048,425 395 945 46x 

Illumina PE 3,689,644 553,446,600 151 151 127x 

PacBio 122,933 782,530,012 6,366 26,777 179x 
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number of 454 reads (452,052) and genome coverage 
(39 x) from the earlier study was similar to this one 
(Table 1), although addition of the PGM reads resulted in 
905,738 raw reads being used to generate the preliminary 
assembly by Newbler (version 2.6). The record [GenBank: 
ASZX00000000.1] for strain DSM 10061 draft genome is 
reported as 4,323,309 bp. 

In this study, Newbler (version 2.8) was used to assem- 
ble new 454 paired-end reads from a 3-kb insert length 
library (Table 1) into a draft genome sequence that con- 
sisted of 32 contigs (Table 2). The lower number of con- 
tigs (32 versus 100) from the new 454-only assembly 
compared to the draft version [26] is likely due to dif- 
ferences in library types (paired-end versus shotgun) and 
software versions. Assembly of Illumina-only data was 
conducted using the SPAdes [34], Velvet [35], Abyss [36] 
and the CLC Genomics Workbench (CLC Bio) assemblers 
and the best results were obtained by the Velvet assembler 
(Table 2). Previously, we have assembled genome se- 
quences for a range of bacteria using a combination of 
454 and Illumina technologies, whereby initial Illumina 
consensus sequences were shredded into 1.5-kbp over- 
lapped fake reads and assembled together with the 454 
data [37-42]. The best genome assembly obtained for 
strain DSM 10061 using second generation sequencing 
technologies employed such a hybrid approach, which is 
reflected in the lowest number of contigs, the largest sin- 
gle contig and highest N50 value (Table 2). Preliminary 
studies using the Clostridium ljungdahlii DSM 13528 gen- 
ome as a reference and a PCR/Sanger sequencing strategy 
showed contigs could be joined by such an approach 
(Additional file 1). As manual finishing is time consuming 
the potential of PacBio data to generate finished microbial 
genome sequences was assessed. 

Remarkably, one PacBio library preparation and two sin- 
gle molecule real-time sequencing (SMRT) cells produced 
sufficient sequence such that it could be assembled into 
one contiguous DNA fragment that represented the DSM 
10061 genome. The PacBio genome assembly is a similar 
size to the other assemblies (Tables 1 and 2) and genome 
completeness was confirmed by sequence wrap-around. 
This is one of the first de novo sequenced genomes we are 
aware of that has been closed without manual finishing or 



additional data, despite the complexity of the C. autoetha- 
nogenum genome. 

A comparison of the 454/Illumina hybrid assembly to 
the PacBio assembly showed there were small regions of 
overlap in the hybrid assembly that weakly joined contigs, 
and were supported by PCR and Sanger data, but there 
was insufficient support for the Newbler software to join 
them (Additional file 1A). PCR and Sanger data joined 
small gaps between contigs (for example, see Additional 
file IB) in line with predictions using C. ljungdahlii DSM 
13528 as a reference but in other examples much larger 
products were obtained compared to the predicted PCR 
product sizes (Additional file 1C). Other challenges in- 
volved using a related but different species, or strain from 
manual finishing included instances of software not being 
able to design PCR primers, not obtaining PCR products, 
and instances of obtaining multiple PCR products of dif- 
ferent sizes and/or DNA smears. 

Assembly quality assessments and comparisons 

The complexity of the C. autoethanogenum DSM 10061 
genome sequence was assessed and it is classified as a 
class III genome, according to previously described cri- 
teria for repeat sequence content and type [14]. Class III 
genomes are defined as containing repeats that can in- 
clude rRNA gene operons, many mid-scale repeats, such 
as insertion sequences and simple sequence repeats, and 
large phage-mediated repeats, duplications, or large tan- 
dem arrays that are considerably larger than the rRNA 
gene operon. 

PacBio sequencing technology has a high error rate, 
which has been reported as being approximately 18% [3]. 
Due to the random nature of the error [15], it is however, 
possible to get a highly accurate consensus sequence when 
there is high coverage [14,20,21]. For genomes such as 
C. autoethanogenum with extreme guanine and cytosine 
(G + C) contents (31.1 mol% G + C content) and long 
homonucleotide stretches this provides an advantage over 
other sequencing technologies. 

Beyond simple metrics, such as contig number, N50 and 
largest contig size, several bioinformatics approaches have 
been developed to assess assembly quality. The compu- 
ting genome assembly likelihoods (CGAL) method is one 



Table 2 Assembly statistics for strain DSM 10061 





Contigs (number) 


Largest 
contig (bp) 


Contig 
N50 (bp) 


Genome 
size (Mb) 


Scaffolds 


Largest 
scaffold (bp) 


Scaffold 
N50 (bp) 


Assembler 


454/lon Torrent * 


100 


436,795 


115,901 


4.32 








Newbler 2.6 


Illumina only 


57 


460,940 


255,482 


4.3 


53 


769,812 


328,660 


Velvet 1 .2 


454 only 


32 


134,546 


330,116 


4.3 


13 


1,137,876 


898,466 


Newbler 2.8 


Illumina/ 454 Hybrid 


22 


1,137,625 


687,076 


4.3 


13 


1,137,625 


899,926 


Newbler 2.8 


PacBio 


1 


4,352,205 


4,352,267 


4.3 


1 


4,352,267 


4,352,267 


SMRT 2.0 



Previously published as a 4.5-Mb draft genome [26], but present [GenBank: ASZX00000000.1] as 4,323,309 bp. 
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recent approach that assesses uniformity of read coverage 
for assemblies and also evaluates the read errors, library 
insert size distribution and the degree of unassembled 
data [13]. At present, CGAL is only able to utilize Illumina 
reads for its assembly assessment and using Illumina reads 
it ranked the assemblies in the order of best to worst as 
Illumina only, Illumina/454 hybrid, 454, published draft, 
to PacBio, respectively (Additional file 2). The CGAL like- 
lihood principle is based on the possibility that a read is 
produced from every single location in the assembly. Re- 
gions of repetitive DNA were to be sequenced by longer 
reads, which were at times not resolved by the Illumina 
reads (Figure 1) and this may have contributed to the 
lower CGAL scores for assemblies that contained longer 
reads and no Illumina data. QUAST [12], which used the 
PacBio assembly as the reference, ranked the Illumina/454 
hybrid, 454, published draft, and Illumina only assemblies 
in the order of best to worst, respectively and details are 
provided (Additional file 3). 

The tool, recognising errors in assemblies using paired 
reads (REAPR) for genome assembly evaluation [11] de- 
tected no collapsed repeats in the PacBio assembly and 
five in the hybrid assembly and four in each of the other 
assemblies (Additional file 4). The fragment coverage 
distribution (FCD) error detected by REAPR in PacBio 
assembly was at location 3872494 to 3873407 (913 bp). 
This region contains an rRNA gene operon and had very 
low Illumina coverage (40 x as compared to the average 
of 127x). Hence, REAPR reported an error (based on 
Illumina reads only). Even 454 coverage was low in this re- 
gion (19x as compared to average of 46x). However, there 
was 108 x PacBio reads covering this (913 bp) region and 
for the first 392 bp there was also high-quality Sanger se- 
quence support indicating it is unlikely that there is an 
issue for the PacBio assembly in this region. The hybrid 
and PacBio assemblies contained the fewest warnings (83 
and 96, respectively), followed by the Illumina assembly 
(182) and then published draft assembly contained the 
most (190). 

A multiple genome alignment was conducted by alig- 
ning contigs from the different assemblies to the PacBio 
reference assembly to identify conserved regions and to 
evaluate gaps in the different DSM 10061 assemblies. Re- 
gions with no or partial 454 or Illumina contig coverage 
predominantly contained predicted rRNA gene operons 
and other duplicated genes (Figure 1 and Table 3). While 
the draft genome sequence for strain DSM 10061 predicts 
one copy of the 16S rRNA gene [26], nine rRNA clusters 
were predicted using the DSM 10061 PacBio assembly, 
which is the same number of rRNA operons as in the 
closely related C. ljungdahlii DSM 13528 [28]. Based on 
findings in this study and earlier ones [1,3,14], the large 
number of DSM 10061 rRNA clusters and their repetitive 
nature confounded assembly of the shorter reads. 



The latest PacBio RS II SMRT cells are designed to 
select for larger read-lengths when long insert libraries 
(10 to 20 kb) are being prepared, however, preferential 
loading of smaller fragments can still occur and this limits 
sequence output. In this study, smaller fragments were re- 
moved from the PacBio library by size exclusion leading 
to longer read-lengths and greater amounts of sequence 
data than otherwise might have been attained. The long 
reads produced by the new PacBio RS II system, combined 
with sequence depth meant that the principal regions of 
complexity could be resolved using one library prepar- 
ation and two SMRT cells to generate a complete genome 
sequence. The application of long, single-molecule se- 
quencing data will lead to a greater number of finished 
genomes and quality improvements in microbial genome 
databases [14], however the application of the newest ver- 
sion of this technology requires more evaluation before its 
full potential can be assessed for complex genomes. 

General features of the C. autoethanogenum genome, its 
metabolism and comparison to C. ljungdahlii 

The finished genome of C. autoethanogenum DSM 10061 
consists of one chromosome of 4,352,205 bp in size with a 
G + C content of 31.1 mol% and consists of 89 RNA genes 
(Additional file 5). Of the 4,161 genes predicted for this 
strain, 4,042 are protein-coding genes (CDSs) and 18 are 
pseudogenes. The distribution of genes into COG func- 
tional categories is presented (Additional file 6). The pre- 
viously published draft DSM 10061 genome annotation 
included 4,135 predicted coding sequences [26] and the 
related finished C. ljungdahlii DSM 13528 genome which 
is 277,860 bp larger in size contained 4,184 protein coding 
genes [28]. Predicted gene content differences reflect the 
use of different gene-calling algorithms, that draft se- 
quences can split genes in two, and genotypic differences. 
The methodology, accuracy, and specificity of the Prodigal 
gene prediction algorithm used in this study has been de- 
scribed previously [43]. 

Phenotypic and metabolic differences have been reported 
for C. autoethanogenum and C. ljungdahlii [27,44-47]. The 
two are indistinguishable at the 16S rRNA gene level [48] 
and have high scores for similarity based on in silico aver- 
age nucleotide identity comparisons across the genomes 
(0.9977 ANIb) [26]. To evaluate potential coding sequence 
differences between the two organisms, OrthoMCL [49], a 
genome-scale algorithm for grouping orthologous protein 
sequences, was used to compare all the C. autoethano- 
genum proteins to those in C. ljungdahlii and for the recip- 
rocal evaluation (Additional file 7). A general description 
for all OrthoMCL proteins is provided (Additional file 7: 
Table SI). Putative paralogs were identified (l_taxa tab 
in Additional file 7) along with putative orthologs 
(2_taxa file tab in Additional file 7). Proteins without 
orthologs or paralogs were identified using the default 
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Figure 1 Comparison of DSM10061 genome assemblies. The orange colored ring represents the PacBio assembly. The next inner ring 
represents the genes encoded on positive and negative strands respectively and color coded by Clusters of Orthologous Groups (COG) 
categories. The 454/lllumina hybrid assembly and published draft assembly are represented as yellow and green circles, respectively. Next, three 
rings represent the raw-read coverage from PacBio, 454 and lllumina technology, respectively. The gaps in the 454/lllumina hybrid assembly and 
published draft assembly as compared to PacBio assembly are highlighted by red colors. The key genes in the gap regions are shown by black 
markers and intergenic regions are shown by gray markers. The phage region and CRISPR repeats are highlighted on PacBio assembly by blue 
and yellow color, respectively. Detail is provided in Table 3. CRISPR, clustered regularly interspaced short paloindromic repeats. 



settings (C. autoethanogenum unique or C. ljungdahlii 
unique tabs in Additional file 7). This analysis revealed 
that over 10% of the proteome is unique to each bac- 
terium when comparing C autoethanogenum (427 pro- 
teins out of 4,134) and C ljungdahlii (447 out of 4,198). 



The 427 proteins with unique genes to DSM 10061 (as 
listed by OrthoMCL) were searched against the entire 
C. ljungdahlii proteome using BLASTP and an e-value 
similarity criteria of le-5 to identify proteins with truly 
unique function and no homolog, which reduced the 
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Table 3 Regions of low sequence-coverage 
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35 
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Microcompartments protein 
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30 


79 
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2095013 
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36 


97 


Complete 


Partial 
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15 


61 
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rRNA 
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81 
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rRNA 


2117334 
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66 
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tRNA 
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tRNA_Met 
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22 


64 
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59 


Complete 


None 
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Table 3 Regions of low sequence-coverage (Continued) 



tRNA 


2135301 


2135374 


tRNA_Met 


133 


17 


57 


Complete 


None 


tRNA 


2135394 


2136466 


tRNA Met 


139 


35 


74 


Complete 


None 


tRNA 


2135478 


2135563 


tRNA Leu 


140 


30 


62 


Complete 


None 


CAETHG 2076 


2220169 


2221506 


Sigma54 specific transcriptional 
regulator, Fis family 


122 


32 


85 


Partial 


Partial 


CAETHG_2077 


2221658 


2221885 


Transcriptional regulator, 
Fis family 


126 


21 


92 


Partial 


None 


CAETHG_2078 


2222014 


2222994 


Putative sigma54 specific 
transcriptional regulator 


135 


30 


77 


Partial 


Partial 


rRNA 


2271738 


2273235 


16s_rRNA 


165 


10 


26 


None 


None 


rRNA 


2273527 


2276414 


23s_rRNA 


158 


10 


26 


None 


None 


tRNA 


2276744 


2276817 


tRNA_Met 


153 


28 


70 


None 


Complete 


rRNA 


2355334 


2356831 


16s_rRNA 


145 


11 


24 


None 


None 


rRNA 


2357123 


2360010 


23s_rRNA 


136 


13 


23 


None 


None 


tRNA 


2360340 


2360412 


tRNA_Lys 


122 


15 


65 


Complete 


Partial 


rRNA 


2372238 


2373735 


16s_rRNA 


128 


13 


21 


None 


None 


rRNA 


2374027 


2376914 


23s_rRNA 


126 


14 


19 


None 


None 


rRNA 


2392702 


2394199 


16s_rRNA 


134 


12 


20 


None 


None 


rRNA 


2394596 


2397483 


23s_rRNA 


142 


11 


21 


None 


None 


CAETHG_2238 


2397706 


2397882 


Hypothetical protein 


138 


23 


57 


Partial 


Complete 


CAETHG_2268 


2424703 


2425503 


Integrase catalytic region 


115 


26 


61 


Complete 


None 


CAETHG_2269 


2425545 


2426123 


Hypothetical protein 


124 


26 


56 


Complete 


None 


Intergenic 


2666300 


2666515 


NA 


145 


25 


69 


Complete 


None 


Intergenic 


2710650 


2710840 


NA 


124 


36 


71 


Complete 


None 


CAETHG_2526 


2714747 


2715550 


Hypothetical protein 


133 


28 


74 


Complete 


Partial 


Intergenic 


2769840 


2769880 


NA 


124 


23 


67 


Complete 


None 


CAETHG_2620 


2822788 


2823741 


Transposase IS66 


124 


30 


59 


Partial 


Complete 


CAETHG_2621 


2823723 


2824328 


Transposase IS66 


127 


30 


52 


Partial 


Partial 


rRNA 


2935186 


2936683 


16s_rRNA 


127 


14 


27 


None 


None 


tRNA 


2936973 


2937045 


tRNA_Ala 


125 


19 


51 


None 


None 


tRNA 


2937053 


2937126 


tRNAJIe 


125 


26 


58 


None 


None 


rRNA 


2937443 


2940330 


23s_rRNA 


117 


14 


28 


None 


None 


rRNA 


2966992 


2968489 


16s_rRNA 


126 


11 


20 


None 


None 


tRNA 


2968779 


2968851 


tRNA_Ala 


132 


20 


50 


None 


None 


tRNA 


2968859 


2968932 


tRNAJIe 


131 


23 


70 


None 


None 


rRNA 


2969222 


2972109 


23s_rRNA 


128 


10 


19 


None 


None 


CAETHG_2843 


3078642 


3079445 


Dihydropteroate synthase DHPS 


152 


30 


66 


Complete 


Partial 


CAFTHG ?R44 


3079499 


30801 31 

Jwuw i ^) i 


Hvnnthptir?)! nrntpin 

i \yvj\j\.\ icru^ai kji^jidii i 


148 


32 


71 


f~omnlptp 


Partial 


CAETHG_2848 


3085939 


3086742 


Dihydropteroate synthase DHPS 


146 


27 


66 


Complete 


Partial 


CAETHG_2849 


3086796 


3087428 


Hypothetical protein 


139 


31 


75 


Complete 


Partial 


CAETHG_3037 


3301321 


3302088 


MCP methyltransferase, CheR-type 


149 


23 


65 


Complete 


Partial 


CAETHG_3075 


3342748 


3343524 


Transposase IS66 


112 


39 


74 


Complete 


Partial 


CAETHG_3281 


3537107 


3537880 


Hypothetical protein 


109 


27 


55 


Complete 


Partial 


CAETHG_3282 


3537862 


3538704 


Ethanolamine utilization protein 


107 


30 


62 


Complete 


None 


CAETHG_3283 


3538721 


3539026 


Microcompartments protein 


103 


20 


65 


Complete 


None 
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Table 3 Regions of low sequence-coverage (Continued) 



CAETHG_3284 


3539020 


3539286 


Ethanolamine utilization protein 
EutN/carboxysome structural 
protein Ccml 


106 


25 


55 


Complete 


None 


CAETHG_3285 


3539304 


3539975 


Ethanolamine utilization EutQ family 
protein 


110 


29 


63 


Complete 


None 


CAETHG_3286 


3540008 


3540784 


Microcompartments protein 


106 


30 


61 


Complete 


None 


CAETHG_3287 


3540833 


3542350 


Acetaldehyde dehydrogenase 
(acetylating) 


111 


27 


61 


Complete 


Partial 


Intergenic 


3848150 


3848350 


NA 


126 


34 


39 


Complete 


None 


rRNA 


3872016 


3873511 


16s_rRNA 


98 


10 


18 


None 


None 


rRNA 


3873937 


3876824 


23s_rRNA 


107 


14 


21 


None 


None 


CAETHG_4028 


4315106 


4316413 


VanW family protein 


98 


24 


66 


Complete 


Partial 


CAETHG_4029 


4316730 


4319132 


Collagen triple helix repeat- 
containing protein 


94 


13 


38 


Complete 


Partial 


CAETHG_4035 


4325792 


4326292 


VanW family protein 


78 


21 


54 


Complete 


Partial 



a The genomic regions which were not assembled in 454/Draft assembly are listed above; b the 'x' coverage defines the 
coordinates; c 'Complete/partial' contig coverage defines whether the region was completely/partially assembled while 
the respective assembly. Missing regions in either 454/Draft assembly are shown in bold. 



raw-read coverage averaged over given 
'None' defines that this region is missing in 



number of dissimilar or unique proteins to 221 (BLASTP 
analysis tab in Additional file 7). From the proteins iden- 
tified as unique to each bacterium, the majority were 
proteins with hypothetical functions or proteins related to 
particular phage, transposon or CRISPR sequences, but 
proteins with key functions in the metabolism were also 
identified that could explain different phenotypes. These 
differences are discussed below. 

The Wood-Ljungdahl pathway (Figure 2) plays a key 
role in the acetogenic metabolism by allowing the forma- 
tion of acetyl- Co A from CO or C0 2 , and thus, is essential 
for autotrophic growth. Under heterotrophic growth con- 
ditions it permits utilization of produced C0 2 and redu- 
cing equivalents generated during glycolysis to form an 
additional molecule of acetyl-CoA [50]. The genes enco- 
ding for the enzymes of the Wood-Ljungdahl pathway are 
co-localized in one large cluster (CAETHG_ 1606-1621). 
The same organization is also found in other acetogens 
such as C. ljungdahlii [28], C. ragsdalei [28] or C. difficile 
[51], but significant differences at the sequence level are 
present as described earlier [28,51]. This cluster also in- 
cludes the genes for the bifunctional carbon monoxide de- 
hydrogenase/acetyl- Co A synthase (CODH/ACS) enzyme 
complex, the key enzyme in the Wood-Ljungdahl path- 
way. As in C. ljungdahlii, two additional monofunctional 
carbon monoxide dehydrogenases (CAETHG_3005 and 
CAETHG_3899) are encoded in the genome of C. auto- 
ethanogenum that may also be involved in utilization of 
CO and C0 2 . Although CO can be both a carbon and an 
energy source for the bacteria, C0 2 can only be used as a 
carbon source. Additional energy can be generated from 
hydrogen, via hydrogenase enzymes. The genome of C. 
autoethanogenum encodes for six hydrogenases, one (NiFe) 
hydrogenase and five (FeFe) hydrogenases. Interestingly, 



C. ljungdahlii only has five hydrogenases, lacking one of 
the iron-only hydrogenases that are present in C. auto- 
ethanogenum. The genes for this unique (FeFe) hydro- 
genase are in an operon with two genes for NuoF-like 
oxidoreductases (CAETHG_1575-78). The presence of an 
additional hydrogenase enzyme complex could represent a 
significant advantage for C. autoethanogenum during auto- 
trophic growth on CO, C0 2 and H 2 containing gases. Pre- 
liminary RNA-Seq experiments show that this cluster is 
highly expressed under such conditions, underlining the 
importance of this enzyme (Additional file 8). Of the other 
C. autoethanogenum hydrogenases, a second (FeFe) hydro- 
genase gene cluster was also found to be highly expressed. 
This nicotinamide adenine dinucleotide phosphate-oxidase 
(NADPH) -specific electron-bifurcating Hyt hydrogenase 
was recently characterized and found to form a functional 
complex with a formate dehydrogenase [52]. Formate 
dehydrogenase activates C0 2 to formate in the Wood- 
Ljungdahl pathway and additional formate dehydrogenase 
genes are present in the C. autoethanogenum genome 
(Figure 2). C. autoethanogenum also has a predicted for- 
mate transporter (CAETHG_1601) that is not present in 
C. ljungdahlii 

During autotrophic growth, all biomass and products 
must be derived of acetyl-CoA from the Wood-Ljungdahl 
pathway (Figure 2). Fatty acid biosynthesis starts directly 
from acetyl-CoA, whereas production of nucleic acids, 
amino acids, vitamins, cofactors and secondary metabolites 
proceed via pyruvate and gluconeogenesis or the TCA 
cycle. The C. autoethanogenum genome encodes for two 
pyruvate:ferredoxin oxidoreductases (PFOR) that catalyze 
the conversion of acetyl- Co A into pyruvate. C. autoethano- 
genum has a pyruvate, phosphate dikinase (PPDK), but 
interestingly no phophoenolpyruvate synthase (PPSA) and 
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(See figure on previous page.) 

Figure 2 Inferred metabolism of G autoethanogenum. Capital letters in brown denote enzymes. ATP, adenosine triphosphate, ADP, adenosine 
diphosphate; BDO, 2,3-butanediol; CO, carbon monoxide; C02, carbon dioxide; FAD, flavin adenine dinucleotide; FADH2 FD_red, ferredoxin (reduced); 
FD_ox, ferredoxin (oxidized); G3P, 3-phosphoglycerate; GP, glycerone-phosphate; H3P04, phosphate; NAD, nicotinamide adenine dinucleotide (oxidized); 
NADH, nicotinamide adenine dinucleotide (reduced); NADP, nicotinamide adenine dinucleotide phosphate (oxidized); NADPH, nicotinamide adenine 
dinucleotide phosphate (reduced); TCA, tricarboxcylic acid cycle. Note that reaction directionality has not been rigorously determined; in general, 
directionality is as reported in KEGG reactions. Acetyl-CoA (Wood-Ljungdahl) pathway - Reductive branch. W1 Bifunctional CO dehydrogenase/ 
Acetyl-CoA synthase (CODH/ACS) CAETHGJ 620-21, 1608-1 1, W2 Seleno formate dehydrogenase (Fdh) CAETHG_0084, 2789, W3 Non-seleno formate 
dehydrogenase (Fdh) CAETHG_2988, W 4 Formyl-THF ligase (Fhs) CAETHGJ 61 8, W 5 Methenyl-THF cyclohydrolase (FchA) CAETHGJ 61 7, W 6 
Methylene-THF dehydrogenase (FolD) CAETHGJ 61 6, W 7 Methylene-THF reductase (MetF) CAETHGJ 61 4-1 5. Acetyl-CoA (Wood-Ljungdahl) 
pathway - Oxidative branch. C Monofunctional CO dehydrogenase CAETHG_3899, 3005, Electron-bifurcating [FeFe] Hydrogenase (HytCBDEI AE2) 
CAETHG_2798, H 2 Other [FeFe] hydrogenases (Hyd) CAETHG_0110, 0120, 1576, 3569, 3841, H 3 [NiFe] hydrogenase (Hyd) CAETHG_0862, H 4 Hydrogenase 
maturation factor (HypEDCF) CAETHG_0368-0371. Energy conservation. A F^o ATPase (AtplBEFHAGDC) CAETHG_2342-50, N Electron-bifurcating 
NADH-dependent Fd:NADP oxidoreductase (Nfn) CAETHGJ 580, R Rnf complex (RnfCDGEAB) CAETHG_3227-32. Acetate fermentation pathway. Aq 
Phosphotransacetylase (Pta) CAETHG_3358 # Ac 2 Acetate kinase (Ack) CAETHG_3359. Ethanol fermentation pathway. E } Bifunctional aldehyde/alcohol 
dehydrogenase (AdhE) CAETHG_3747, 3748, E 2 Aldehydefd oxidoreductase (AOR) CAETHG_0092, 0102, E 3 Additional alcohol dehydrogenases (Adh) 
CAETHG_0555. 2,3-butanediol fermentation pathway. B ] Acetolactate synthase (AIsS) CAETHG_0 124-25, 0406, 1740, B 2 Acetolactate decarboxylase 
(BudA) CAETHG_2932, B 3 2,3-butanediol dehydrogenase (Bdh) CAETHG_0385, Lactate fermentation pathway. L Lactate dehydrogenase (Ldh) 
CAETHG_1 147. Central pyruvate metabolism. P ] Pyruvateferredoxin oxidoreductase (PFOR) CAETHG_0928, 3029, P 2 Pyruvate, phosphate dikinase 
(PPDK) CAETHG_2055, 2909, P 3 Pyruvate kinase (Pk) CAETHG_2440-41 , P 4 Pyruvate carboxylase (Pyc) CAETHG_1594 # P 5 PEP carboxykinase (PEPCK) 
CAETHG_2721 # P 5 Malic enzyme CAETHG_0605, 1055. Incomplete TCA cycle. ^ Citrate synthase CAETHG _2751, T 2 Citrate lyase CAETHGJ 052-54, 
1898-1901, 2480-83, T 3 Aconitase (Aco) CAETHGJ 051, 2752, T 4 Isocitrate dehydrogenase (ldh) CAETHG_2753, T 5 Malate dehydrogenase (Mdh) 
CAETHG_1 702, 2478, 2689, T 6 Fumarase CAETHGJ 902-03, 2062, 2479, T 7 Fumarate reductase CAETHG_0344, 1 032, 2961 . Glycolysis/Gluconeogenesis. 
PTS Fructose phosphotransferase system (PTS) CAETHG_0142 # 0676-73, G] Fructokinase (Fk) /Fructose-6-phosphate isomerase CAETHG_0166, 0156, G 2 
1-phosphofructokinase (Pfkl) CAETHG_0143, G 3 6-phosphofructokinase (Pfk6) CAETHG_648, 2439, G 4 Fructose bisphosphate aldolase (Aldo) 
CAETHG_2382 # G 5 Triose-phosphate isomerase (Tpi) CAETHG_1 758, G 6 Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) CAETHGJ 760, 3424, G 7 
Phosphoglycerate kinase (Pgk) CAETHGJ 759, G 8 Phosphoglycerate mutase (Pgm) CAETHG_712, 1757, G 9 Enolase phosphopyruvate hydratase 
(Eno) CAETHGJ 756. 



C. ljungdahlii contains two such enzymes (CLJU_cl4340 
and CLJU_c38600). The rest of gluconeogenesis is similar 
in both organisms. C. autoethanogenum has an incomplete 
TCA cycle to succinate and 3-oxogluterate (Figure 2). C. 
autoethanogenum products such as 2,3-butanediol and lac- 
tate are also derived from pyruvate [28], whereas ethanol is 
produced from acetyl-CoA via acetaldehyde, either directly 
via bifunctional aldehyde/alcohol dehydrogenases or via 
acetate using phosphotransacetylase and acetate kinase and 
an aldehyde:ferredoxin oxidoreductase (Figure 2). Several 
additional alcohol dehydrogenases are present in the 
genome of C. autoethanogenum. 

Heterotrophic growth on a range of other products such 
as a range of C5 and C6 sugars has been described for C. 
autoethanogenum [27]. A PTS system and other respec- 
tive genes could be identified in the genome (Figure 2). In 
contrast to C. ljungdahlii, some extra genes involved in 
mannose metabolism are present in C. autoethanogenum 
as well as genes for aromatic compound degradation. C. 
autoethanogenum also has an additional predicted nitrate 
reductase (CAETHG_0085) and both organisms differ in 
some of their transport systems. 

Other differences between C. autoethanogenum and C. 
ljungdahlii include variations in the sporulation program, 
with several unique predicted sporulation proteins and 
regulators present in C. autoethanogenum strain DSM 
10061, and different defense systems, such as restriction/ 
methylation systems and a CRISPR system that is present 
in C. autoethanogenum but not in C. ljungdahlii. Insertion 



sequence (IS) elements, are usually unique to a strain, and 
one is found in C. autoethanogenum between 4,345,780 
and 4,347,448 bp that is 100% identical to one in C. ljung- 
dahlii. C. autoethanogenum DSM 10061 was enriched 
from rabbit feces in Belgium [27] and C. ljungdahlii DSM 
13528 was isolated in the US from chicken yard waste 
[47]. Despite the geographical separation of the isolates, 
the overall degree of similarity between C. ljungdahlii and 
C. autoethanogenum suggests a common ancestor. 

C. autoethanogenum CRISPR system 

CRISPR are prokaryotic DNA loci that carry the memory 
of past bacterial infections of phages and plasmids to pro- 
vide immunity against mobile genetic elements [53,54]. In 
the last decade, several studies have unraveled CRISPR 
defense molecular details and mechanisms of action 
[53,55,56]. Briefly, CRISPR loci are composed of arrays of 
24 to 47 bp partially palindromic, highly conserved repeats 
separated by variable spacers specific to the infecting 
DNA. CRISPR-associated (cas) genes are involved in spa- 
cer acquisition, expression and interference to phage or 
plasmid. cas gene operons are classified into three types 
and several subtypes, and can target either DNA or RNA, 
or both [53]. CRISPR and cas gene operons are proposed 
to be transferred between distinctly related strains by hori- 
zontal gene transfer and/or by transposons [57], and the 
latter can be identified by the presence of insertion 
elements and transposase/mutase in its vicinity. Thus, 
CRISPR appear to be dynamic heritable defense systems 
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in bacteria against plasmids and phages that are ever fast- 
evolving and play important roles in the co-evolution of 
both bacteria and phages. 

The genome of C. autoethanogenum is found to contain 
eight cas genes of Type-I B, all predicted to be in one op- 
eron on the antisense strand with a predicted transcrip- 
tion terminator at the end of the cas2 gene, and it is 
flanked by three CRISPR arrays (Table 4, Additional files 9 
and 10) with a total of 93 30-bp-repeats (consensus 5'- 
GTTGAACCTCAACATGAGATGTATTTAAAT-3 ') and 
90 spacers of 35 to 38 bp. All three CRISPR arrays are pre- 
ceded by a 177-bp-leader sequence, which is essential for 
array transcription and a fragment of the leader sequence 
is co-transcribed with the array [58,59]. The three 
putative C. autoethanogenum CRISPR arrays leader se- 
quences share 82 to 91% sequence similarity between 
them (Additional file 9). Interestingly, 10 kb downstream 
of the three CRISPR arrays, an incomplete leader se- 
quence of 65 bp was found that has high sequence identity 
to the other leaders in close proximity (100 bp) to an im- 
perfect CRISPR repeat (5 -GTTGAACCTtAACATGA- 
GATGTAaaggtAa-3 '). In addition to the three CRISPR 
arrays flanking the cas genes, a putative extra CRISPR 
array was identified in the genome, consisting of three 
55-bp-repeats and two 16-bp-spacer (Additional file 10). 



Expression of cas genes and CRISPR arrays along with 
their leader sequence were studied by Reverse Trans- 
criptase PCR (RT-PCR) and RNA-Seq during logarithmic 
growth under autotrophic conditions. PCR amplification 
of fragments of expected sizes were observed only with 
cDNA template and not with RNA, showing the absence 
of genomic DNA contamination in RNA preparations 
(Additional file 9). All eight predicted cas genes appear to 
be co-expressed and from a single operon. Expression of 
spacers distal to the leader sequences in all three arrays 
was also assessed. Based on sequence similarity between 
the three leader sequences, a common reverse primer was 
designed to align to the conserved region in the leader se- 
quences of all three arrays and forward primers aligning 
specifically to spacers proximal to the leader sequence in 
each array (Additional file 9). Both the leader proximal 
and distal spacers of array 2 were found to be expressed, 
whereas expression was detected only for spacers prox- 
imal to the leader sequence in array 3 and no expression 
was detected from array 1 or the identified extra leader. 
Preliminary RNA-Seq data showed expression of all 
CRISPR RNAs (crRNAs), with different abundances. A 
few transcripts corresponding to the leader region of the 
three CRISPR arrays were also detected. Similar to the cas 
gene operon, the three CRISPR arrays including their 



Table 4 Overview of CRISPR systems, plasmids and prophages in fuel-producing Clostridium species 

Category Organism Genome Status Plasmid Plasmid Prophages Prophage CRISPR CRISPR 

size (Mb) reference (s) size (kb) (number)* (s) size (kb) arrays repeats 

number (number)* (number) 



Solventogenic 
(ABE) Clostridia 


C. acetobutylicum ATCC824 
C. acetobutylicum EA2018 


3.94 
3.94 


complete 
complete 


[60] 
[61] 


1 192 
1 192 


3 
3 


191 
191 








C. acetobutylicum DSM1731 


3.94 


complete 


[62] 


2 201 


3 


191 








C. beijerinckii NCIMB8052 


6.00 


complete 






4 


106 








C. saccharobutylicum 
DSM 13864 


5.10 


complete 


[63] 




5 


133 


4 


55 




C. saccharoperbutylacetonicum 
N1-4 


6.53 


complete 


[64] 


1 136 


6 


161 


4 


67 


Cellulolytic 
Clostridia 


C. cellulolyticum H10 
C. cellulovorans 743 B 


4.07 
5.26 


complete 
complete 


[65] 




8 
5 


210 
179 


3 
3 


23 
44 




C. thermocellum ATCC27405 


3.84 


complete 


[66] 




5 


222 


5 


442 




C. thermocellum DSM1313 


3.56 


complete 


[67] 




1 


26 


5 


189 




C. phytofermentans ISdg 


4.85 


complete 






1 


28 






Acetogenic 
Clostridia 


C. autoethanogenum 
DSM 10061 


4.35 


complete 


this study 




4 


115 


3 


95 




C. Ijungdhalii DSM 13528 


4.63 


complete 


[28] 




6 


248 








C. carboxidivorans P7 


5.59 


251 




1 20 


1 


19 







contigs 

4.40 69 [68] 1 20 

contigs 



*Please refer to Additional file 10: Table S9 for details. 
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leader regions were transcribed from the antisense 
strand. The three CRISPR arrays were found to be con- 
stitutively transcribed and processed into crRNAs of 
varying lengths (Additional file 8). However, the pro- 
cessed crRNAs appeared to have a well-defined eight- 
nucleotide 5' handle, 5'-ATTTAAAT-3', originating 
from the repeat region followed by the spacer sequence 
(Additional file 8). The 3' end of these processed 
crRNAs had varying tags (Additional file 8). Based on 
these findings, a scheme for CRISPR processing in C. 
autoethanogenum is proposed (Additional file 8). Simi- 
lar processing of crRNA was observed across the three 
samples collected at different time points. 

CRISPR spacer sequences in C. autoethanogenum were 
analyzed to identify potential target DNA sequence. A 
BLAST search did not result in high identity hits with- 
in the National Center for Biotechnology Information 
(NCBI) database or against its own genome. A compari- 
son of regions of DNA from putative C. autoethanogenum 
processing crRNAs from all three arrays identified the se- 
quence 5 -ATTTAAAT-3 ' (Additional file 9), which is 
similar to sequences from Clostridium thermocellum [69], 
Methanococcus maripaludis [69], Escherichia coli [70] and 
Pyrococcus furiosus [71], which also have type-IB CRISPR 
systems. In these organisms the processing of crRNA is 
mediated by the cas6 gene, which is also found in C. auto- 
ethanogenum. Unlike in C. thermocellum [69], M. maripa- 
ludis [69], Sulfolobus acidocaldarius [72] and P. furiosus 
[73], C. autoethanogenum crRNAs were transcribed only 
from the antisense strand and no anti-crRNA transcripts 
originating from the complementary strand were detected. 

Identification and classification of CRISPR systems in 
industrial relevant Clostridia 

The presence of a CRISPR system in C. autoethanogenum 
compared to C. ljungdahlii could provide an advantage 
in industrial fermentations. The C. autoethanogenum 
CRISPR system was compared to those from other indus- 
trial relevant Clostridia strains to better understand their 
characteristics and their potential physiological and ap- 
plied roles. In particular, the Clostridial ABE fermentation 
process has a history of phage infections [74]. CRISPR 
systems from 14 Clostridium species were examined for 
the first time including those used in ABE fermentation 
processes: C. acetobutylicum, C beijerinckii, C saccharo- 
butylicum and C. saccharoperbutylicum, cellulose degra- 
ding C. thermocellum, C cellulolyticum, C cellulovorans, 
and C. phytofermentans, and the acetogens C. autoethano- 
genum, C ljungdahlii and C. carboxidivorans. CRISPR ele- 
ments were identified only in 8 of the 14 Clostridium 
species analyzed by PILER [75] and CRISPRdb [76]. All of 
the loci were found on chromosomes and none on any 
plasmids or megaplasmids (Table 4). From the ABE fer- 
mentation-Clostridia examined, only C. saccharobutylicum 



DSM 13864 has a CRISPR system, but not several strains 
of the more commonly used C. acetobutylicum, C beijer- 
inckii and C. saccharobutylicum. This may be one of the 
reasons why the ABE fermentation process was historically 
found to be prone to phage infections [74]. From the three 
acetogenic strains investigated only C. autoethanogenum 
had a CRISPR system, whereas all analyzed celluloly- 
tic Clostridia, but C. phytofermentans contain CRISPR 
systems. 

In all Clostridium species that harbor CRISPR arrays, 
cas genes were identified. C. cellulolyticum and C. thermo- 
cellum had two and four different cas operons, respectively 
(Additional files 10 and 11). These cas operons were clas- 
sified based on a recently proposed classification system 
[53] and their target molecule (s) inferred. A phylogenetic 
analysis of casl genes was performed and compared to the 
16S rRNA phylogeny (Additional file 12). In C. cellulolyti- 
cum, arrays 1 and 2 are associated with the Type I-C cas 
system and Adb with the Type II cas system. The two ar- 
rays are separated by a transposase and mutase genes 
(Additional file 11). C. cellulolyticum has two different sets 
of cas genes, both of which appear to target DNA. The 
C. cellulovorans cas operon could not be classified accord- 
ing by these criteria, nor could the target of its cas genes 
be inferred. C. thermocellum appears to have a Type III cas 
genes system (Additional file 11). The Type III cas system 
contains more than one type of cas gene operon belonging 
to either Type I or II or the repeat-associated mysterious 
proteins (RAMP) module operon and are predicted to 
target both DNA and RNA [53]. The arrays 3 and 4 in C. 
thermocellum are associated with the Type I-B cas system 
[69] and arrays 5 and 6 to a cas system similar to Type I-B 
but interrupted by insertion of multiple other genes that 
separate casl, cas2, cas4 genes (possibly involved in spacer 
acquisition) from cas3, 5, 7 and 8b (predicted to be in- 
volved in DNA interference) (Additional file 11). The array 
1 is not associated with any cas gene cassette and is 
flanked by a integrase and mutase genes at the 3' end. 
Apart from these two cas systems, C. thermocellum also 
has a RAMP module gene cassette that may be involved in 
RNA interference. This cassette is not associated with any 
array and could be acting in trans. As in a few lactic acid 
bacteria [57], integrase and mutase genes were frequently 
found near the cas gene cassette, particularly flanking the 
Type 1 cas gene cassette found in C. cellulovorans, C. ther- 
mocellum and C. autoethanogenum, suggesting possible 
horizontal gene transfer. These genes were also found next 
to array- 1 in C. thermocellum. 

The C. autoethanogenum CRISPR repeat DNA was not 
found in any of the other Clostridium species included in 
this study. A search for organisms with repeats similar to C. 
autoethanogenum in the CRISPRdb database [75] resulted 
in Clostridium novyi, Eubacterium limosum, along with a 
few Clostridium botulinum substrains. A comparison of the 
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repeat sequences showed very high sequence similarity 
(Additional file 12). The cas genes operon in C autoethano- 
genum, C novyi and E. limosum were all of Type-I B, 
whereas the C botulinum substrains had different sets of 
cas genes. The cas gene operon architecture, the arrange- 
ment of arrays on the chromosome and the presence of 
two hypothetical genes separating arrays 2 and 3 in C. auto- 
ethanogenum and C. novyi are strikingly alike, suggesting a 
common lineage of these two CRISPR-cas systems. This 
observation was further strengthened by the phylogenetic 
classification placing C. autoethanogenum casl gene to- 
gether with casl genes from C. novyi and E. limosum and 
apart from the other Clostridium species (Additional file 
12). Even though the repeat and the cas genes operon in C. 
autoethanogenum, C. novyi and E. limosum are largely 
identical, no similarity was found between the spacers. 

Comparison of strains with/without the CRISPR system to 
plasmid and prophage content 

Correlation between the presence of CRISPR and the oc- 
currence of prophages or plasmids has been reported [77] . 
To assess the 14 Clostridium species for this correlation 
their genome sequences were analyzed for presence of po- 
tential prophage regions. In C. autoethanogenum, four pu- 
tative prophages were identified, an incomplete prophage 
similar to a Singapore grouper iridovirus, an intact pro- 
phage similar to a Geobacillus E2 virus and two intact pro- 
phages inserted into tRNAs (Additional file 10). One 
prophage was identified a Trp-tRNA and the other in an 
Arg-tRNA. The latter is in almost identical form also 
present in C. ljungdahlii, suggesting a shared lineage. Pro- 
phage regions were detected in all species irrespective of 
the presence of CRISPR modules (Table 4 and Additional 
file 10). Although there seems to be no general trend and 
it cannot be determined whether a prophage infection oc- 
curred before or after a CRISPR system was acquired, in a 
few cases bacteria that lacked CRISPR systems appeared 
to have more abundant prophage sequences. 

When looking for plasmid content, only one out of 
seven strains containing CRISPR systems was found to 
contain a plasmid. Likewise, only one out of five plasmid- 
carrying strains contained a CRISPR system. CRISPR- 
mediated immunity has been shown experimentally to 
block conjugative plasmid acquisition [78], although the 
role of CRISPR in driving plasmid and phage evolution for 
industrially relevant Clostridia and other microorganisms 
remains to be fully elucidated. 

Conclusions 

A comparative genomic analysis revealed short-read tech- 
nologies were unable to overcome C. autoethanogenum 
DSM 10061 repeat regions largely associated with nine 
copies of the rRNA gene operons. A previous study sug- 
gested that long single-molecule reads are sufficient to 



assemble most known microbial genomes based on a bio- 
informatics analysis of 2,267 complete genomes for bac- 
teria and archaea and sequencing results for six bacteria 
[14]. The genome sequence of C. autoethanogenum DSM 
10061 is classified as within the most complex class of 
bacterial genomes and a complete genome sequence was 
generated for it using long single-molecule reads and 
without the need for manual finishing. The relatively 
low cost to generate the PacBio data (approximately US 
$1,500) and the outcome of this study support the asser- 
tion this technology will be valuable in future studies 
where a complete genome sequence is important and for 
complex genomes that contain large repeat elements. 

Clostridia are known for their substrate and metabolic 
flexibility, which makes them attractive biocatalysts for 
biofuel and biorefinery applications [79]. Acetogenic Clos- 
tridia, such as C. autoethanogenum, are of interest due to 
their ability to ferment abundant syngas or waste gases to 
useful products [29]. The C. autoethanogenum genome se- 
quence will facilitate strain development for biofuels and 
biochemicals production and comparative genomics in 
the future. A comparison between C. autoethanogenum 
and C. ljungdahlii identified distinct differences, notably 
the presence of a CRISPR system, an additional C. auto- 
ethanogenum hydrogenase, and several differences in cen- 
tral metabolism, although the two bacteria likely descend 
from a common ancestor. Comparative genomic analysis 
and characterization of CRISPR, plasmid content and pro- 
phage among Clostridia with biotechnological interest was 
performed. Notably, the classic ABE fermentation strains 
C. acetobutylicum and C. beijeinckii are reported to be 
prone to bacteriophage infections [63] and all lack a 
CRISPR system and only one of the analyzed 14 strains 
contain both a plasmid and a CRISPR system. From the 
acetogenic Clostridium strains sequenced to date, only C. 
autoethanogenum possesses a CRISPR system. Further 
consideration of Clostridia CRISPR systems may be in- 
formative for bioprocess development strategies and for 
ecological studies. 

Methods 

DNA sequence data generation 

C. autoethanogenum strain JA1-1 was obtained from the 
Deutsche Sammlung von Mikroorganismen und Zellkultu- 
ren (DSMZ) culture collection (DSM 10061). C. autoetha- 
nogenum strain JA1-1 was cultured in PETC medium as 
described [28]. Single colony was purified and 16S rDNA 
sequence confirmed before genomic DNA was prepared. 
High molecular weight genomic DNA was prepared as de- 
scribed earlier [28], quantified with a NanoDrop ND-1000 
spectrophotometer (NanoDrop Technologies, Wilmington, 
DE, USA) and quality was assessed with Agilent Bioanalyzer 
(Agilent, Santa Clara, CA, USA). 
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Pyrosequencing was conducted using the Roche 454 
GS FLX System (Roche 454 Life Sciences, Branford, CT, 
USA) with the method of paired-end DNA library pre- 
paration and average insert sizes in the 3-kb range and 
Titanium chemistry, according to the manufacturer s in- 
structions as described previously [38,80]. Sequence data 
were also generated using a MiSeq instrument (Illumina, 
San Diego, CA, USA) [16] and a paired-end approach 
with an approximate insert library size of 500 bp and 
read lengths of 151 bp, as described previously [81] and 
according to the manufacturers instructions. DNA for 
PacBio sequencing was sheared with G-tubes (Covaris, 
Inc., Woburn, MA, USA), targeting 20-kb fragments. 
PacBio libraries were prepared with the DNA Template 
Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA) 
and library fragments above 4 kb were isolated using the 
Blue Pippin system (Sage Science, Inc., Beverly, MA, 
USA). The average PacBio library insert size (including 
adapters) was approximately 19 kb and samples were 
sequenced using Magbead loading, C2 chemistry, Poly- 
merase version P4, and software version 2.02. Raw next- 
generation sequence data available through the NCBI SRA 
database [SRX352885; SRX352888; SRP030033]. PCR 
and Sanger sequencing were conducted using standard 
approaches as described previously [82] and primer se- 
quences are described (Additional file 1). 

Sequence data trimming, filtering, annotation and 
assembly 

The CLC Genomics Workbench (version 6.0.2) (CLC bio, 
Cambridge, MA, USA) was used to trim and filter Illu- 
mina reads for quality sequence data and the subsequent 
Illumina assembly. The Newbler application (version 2.8) 
in the 454 GS FLX software package (Roche 454 Life 
Sciences) was used to assemble reads generated from the 
GS FLX instrument and in combination with reads from 
the Illumina instrument, as described previously [38]. The 
consensus Illumina sequences were processed before in- 
putting into the Newbler assembler by generating 1.5-kb 
overlapping fake reads using the fb_dice.pl script, which is 
part of the FragBlast module (http://www.clarkfrancis. 
com/codes/fb_dice.pl). The PacBio reads were assembled 
through SMRTanalysis v 2.0 (Pacific Biosciences) using 
the HGAP protocol [20]. The DSM 10061 PacBio assem- 
bly was annotated using the Prodigal gene-calling algo- 
rithm [43] and deposited in the NCBI database [GenBank: 
CP006763]. 

Assessment of genome assembly quality 

The in silico evaluation of genome assemblies was per- 
formed using CGAL (version 0.9.6) [13], REAPR (version 
1.0.16) [11], QUAST (version 2.2) [12] and Circos [83]. 
The genomic repeats were identified using Nucmer [84]; 
genome complexity was determined based on count and 



length of the repeats as suggested earlier [14]. Gaps in the 
454/Illumina hybrid and published draft assemblies were 
determined by performing multiple genome alignment 
through Mauve (version 2.3.1) [85] with PacBio assembly 
used as reference genome. The order of contigs in 454/ 
Illumina hybrid assembly and alignment of Sanger se- 
quences was determined using Genious software (version 
6.1.5) (Biomatters, Auckland, New Zealand). 

Analysis, classification and comparison of CRISPR, 
plasmid, and prophage content in C. autoethanogenum 
and other fuel-producing Clostridia 

The genome of C. autoethanogenum (NC_022592) and 
genome sequences of C. acetobutylicum ATCC824 (NC_ 
003030), DSM1731 (NC_015687) and EA2018 (NC_017 
295), C. beijerinckii NCIMB8052 (NC_009617), C. sacchar- 
obutylicum (NC_022571), C. saccharoperbutylacetonicum 
(NC 020291), C. cellulolyticum H10 (NC_011898), C. cellu- 
lovorans 743B (NC_014393), C. thermocellum ATCC27405 
(NC 009012) and DSM1313 (NC_017304), C. phytofermen- 
tans Isdg (NC_010001), C. ljungdahlii DSM13528 (NC_ 
014328) and C. carboxidivorans ( AC VI0 1000000; ADEK 
01000000) were retrieved from NCBI Genbank. The ge- 
nome sequences of all these organisms were analyzed for 
CRISPR repeats using the PILER algorithm [75] and 
CRISPRdb [76]. The sequence of plasmids found in C. acet- 
obutylicum ATCC824 (NC_001988), DSM1731 (NC_015 
686 and NC 015688) and EA2018 (CP002119), C. sacchar- 
operbutylacetonicum (NC_020292), and C. carboxidivorans 
(NC_0 14565) were also analyzed for the presence of 
CRISPR loci. The repeat sequences detected by PILER and 
CRISPRdb were combined and manually compared at se- 
quence level. Degenerated CRISPR repeats with several 
mismatches were not taken into account. The genome se- 
quences of these species were also analyzed for prophage 
regions using PHAST [86], Phage_Finder [87] and the 
outputs program manually analyzed. Multiple sequence 
alignment of repeats and their sequence logos were gener- 
ated using CLUSTALW and Weblogo. Phylogenetic ana- 
lyses based on 16S rRNA and casl genes were made using 
Geneious software. The phylogenetic tree was constructed 
using the neighbor- joining method with 100 bootstrap 
steps. 

RT-PCR 

RT-PCR was performed to study the expression and op- 
eron structure of cas genes and the expression CRISPR ar- 
rays. Briefly, RNA was isolated (RNeasy Mini Kit, Qiagen, 
Valencia, CA, USA) from 20 mL C. autoethanogenum cul- 
ture growing in serum bottles at an optical density (OD) 600 
of 0.2 in PETC media and steel mill waste gas (compos- 
ition: 42% CO, 36% N 2 , 20% C0 2 , and 2% H 2 ; collected 
from a New Zealand Steel site in Glenbrook, New Zealand) 
as the sole energy and carbon source. cDNA was 
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synthesized using 500 ng of DNasel (Ambion Inc., Austin, 
TX, USA) -treated RNA, Superscript III reverse transcrip- 
tase and random primers (Life Technologies, Grand Island, 
NY, USA). PCR was set with 30 ng cDNA- and DNasel- 
treated RNA (control) as templates and iproof DNA poly- 
merase (Biorad, Hercules, CA, USA). The primers used in 
this study are listed (Additional file 9). 

RNA-Seq 

RNA-Seq was performed from C. autoethanogenum gro- 
wing in continuous culture in a 1.5-L continuous -stirred 
tank reactor (CSTR) with steel mill waste gas (composition: 
42% CO, 36% N 2 , 20% C0 2 , and 2% H 2 ; collected from a 
New Zealand Steel site in Glenbrook, New Zealand) as the 
sole energy and carbon source as described previously [52], 
A 20-ml sample was centrifuged at 4,000 x rpm for 10 mi- 
nutes at 4°C. The supernatant was discarded and the pellet 
was stabilized adding 5 ml of RNAlater® (Ambion Inc). 
Total RNA was isolated from the cell pellet using 
RiboPureTM-Bacteria Kit (Ambion Inc.) according to the 
manufacturers standard protocol. DNA was removed 
using the TURBO DNA-free kit (Ambion Inc.) and RNA 
quality was assessed using a 2100 bioanalyzer (Agilent 
Technologies). RNA concentration was determined with a 
nanodrop 2000 (Thermo Fischer Scientific, Waltham, MA, 
USA). Ribodepletion was conducted using MICROBEEx- 
pressTM kit (Ambion Inc.). cDNA libraries were prepared 
and sequenced by standard procedures using SOLiD2 se- 
quencing technology. Output from the SOLID run was 
processed in LifeScope v2.5.1., as specified by the manufac- 
turer (http://downloads.lifetechnologies.com/Analysis_Soft- 
ware/GS/LifeScope/v2.5.1/LifeScope-v2.5.1_4476538_AUG. 
pdf). Processed reads were mapped to a reference assembly, 
and the resulting BAM files were imported, displayed, and 
manually inspected in the Geneious genome browser, 
v7.0.3 (Biomatters). 

Additional files 



Additional file 6: Clusters of Orthologous Groups (COG) analysis. 

Number of genes associated with COG functional categories for DSM 
10061 PacBio assembly. 

Additional file 7: Identification of C. autoethanogenum from C. 

Ijungdahlii orthologs. The OrthoMCL algorithm [49] was used for 
ortholog analysis. The 1_taxa tab contains putative paralogs, 2_taxa file 
contains putative orthologs, Unique tabs contains Unique proteins using 
default setting, and Table 1 file is general descriptor for all proteins. The 
427C. autoethanogenum from the unique tab were compared to the C. 
Ijungdahlii proteome using BLASTP and 221 proteins were considered 
dissimilar or unique to DSM 10061 based on e-value scores of <1e-5. 

Additional file 8: RNA-Seq data for Hydrogenase operon 
CAETHG_1 575-78 and clustered regularly interspaced short 
paloindromic repeats-associated (CRISPR-cas) array of C. 
autoethanogenum. Mapped RNA-Seq reads for (FeFe) hydrogenase 
operon CAETHG_1 575-78 and CRISPR-cas system of C. autoethanogenum. 
Processing of crRNA in C. autoethanogenum. 

Additional file 9: Reverse transcriptase (RT)-PCR data for C. 
autoethanogenum clustered regularly interspaced short 
paloindromic repeats-associated (CRISPR-cas) system. Organization 
and expression of the C. autoethanogenum CRISPR system as determined 
by RT-PCR and primer list for RT-PCR. 

Additional file 10: clustered regularly interspaced short 
paloindromic repeats (CRISPR) arrays and prophage regions in 
analyzed genomes. Overview of position and sequences of identified 
CRISPR arrays and prophages in industrial relevant Clostridia. 

Additional file 11: Graphical representation of clustered regularly 
interspaced short paloindromic repeats-associated (CRISPR-cas) loci 
in Clostridium species. 

Additional file 12: Phylogenetic classification of industrial relevant 
Clostridia based on 16S rRNA and clustered regularly interspaced 
short paloindromic repeats (CRISPR) systems. Classification of the 
CRISPR-cas system of C. autoethanogenum and phylogenetic classification 
based on 16S rRNA and casl genes. 
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